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Abstract 

We use a Dynamic Bayesian Network (dbn) to represent compactly 
a variety of sublexical and contextual features relevant to Part-of- 
Speech (PoS) tagging. The outcome is a flexible tagger (LegoTag) 
with state-of-the-art performance (3.6% error on a benchmark cor- 
pus). We explore the effect of eliminating redundancy and radically 
reducing the size of feature vocabularies. We find that a small but lin- 
guistically motivated set of suffixes results in improved cross-corpora 
generalization. We also show that a minimal lexicon limited to func- 
tion words is sufficient to ensure reasonable performance. 

1 Part-of-Speech Tagging 

Many NLP applications are faced with the dilemma whether to use statis- 
tically extracted or expert-selected features. There are good arguments in 
support of either view. Statistical feature selection does not require exten- 
sive use of human domain knowledge, while feature sets chosen by experts 
are more economical and generalize better to novel data. 

Most currently available PoS taggers perform with a high degree of ac- 
curacy. However, it appears that the success in performance can be over- 
whelmingly attributed to an across-the-board lexicalization of the task. In- 
deed, Charniak, Hendrickson, Jacobson & Perkowitz (1993) note that a 
simple strategy of picking the most likely tag for each word in a text leads 
to 90% accuracy. If so, it is not surprising that taggers using vocabulary 
lists, with number of entries ranging from 20k to 45k, perform well. Even 
though a unigram model achieves an overall accuracy of 90%, it relies heav- 
ily on lexical information and is next to useless on nonstandard texts that 
contain lots of domain-specific terminology. 

The lexicalization of the PoS tagging task comes at a price. Since word 
lists are assembled from the training corpus, they hamper generalization 
across corpora. In our experience, taggers trained on the Wall Street Journal 
(wsj) perform poorly on novel text such as email or newsgroup messages 
(a.k.a. Netlingo). At the same time, alternative training data are scarce 
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and expensive to create. This paper explores an alternative to lexicalization. 
Using linguistic knowledge, we construct a minimalist tagger with a small 
but efficient feature set, which maintains a reasonable performance across 
corpora. 

A look at the previous work on this task reveals that the unigram model 
is at the core of even the most sophisticated taggers. The best-known rule- 
based tagger (Brill 1994) works in two stages: it assigns the most likely 
tag to each word in the text; then, it applies transformation rules of the 
form "Replace tag X by tag Y in triggering environment Z" . The triggering 
environments span up to three sequential tokens in each direction and refer 
to words, tags or properties of words within the region. The Brill tagger 
achieves less than 3.5% error on the Wall Street Journal (wsj) corpus. How- 
ever, its performance depends on a comprehensive vocabulary (70k words). 

Statistical tagging is a classic application of Markov Models (mms) . 
Brants (2000) argues that second-order MMs can also achieve state-of-the- 
art accuracy, provided they are supplemented by smoothing techniques and 
mechanisms to handle unknown words. TnT handles unknown words by 
estimating the tag probability given the suffix of the unknown word and its 
capitalization. The reported 3.3% error for Trigrams 'n Tags (TnT) tagger 
on the WSJ (trained on 10 6 words and tested on 10 4 ) appears to be a result of 
overfitting. Indeed, this is the maximum performance obtained by training 
TnT until only 2.9% of words are unknown in the test corpus. A simple 
examination of WSJ shows that such percentage of unknown words in the 
testing section (10% of WSJ corpus) requires simply building a unreasonably 
large lexicon of nearly all (about 44k) words seen in the training section (90% 
of wsj) , thus ignoring the danger of overfitting. Hidden MMs (hmms) are 
trained on a dictionary with information about the possible PoS of words 
(Jelinek 1985; Kupiec 1992). This means hmm taggers also rely heavily on 
lexical information. 

Obviously, PoS tags depend on a variety of sublexical features, as well 
as on the likelihood of tag/tag and tag/word sequences. In general, all 
existing taggers have incorporated such information to some degree. The 
Conditional Random Fields (crf) model (Lafferty, McCallum & Pereira 
2002) outperforms the hmm tagger on unknown words by extensively rely- 
ing on orthographic and morphological features. It checks whether the first 
character of a word is capitalized or numeric; it also registers the presence 
of a hyphen and morphologically relevant suffixes (-ed, -ly, -s, -ion, -tion, 
-ity, -ies). The authors note that CRF-based taggers are potentially flexible 
because they can be combined with feature induction algorithms. How- 
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ever, training is complex (AdaBoost + Forward-backward) and slow (10 3 
iterations with optimized initial parameter vector; fails to converge with 
unbiased initial conditions). It is unclear what the relative contribution of 
features is in this model. 

The Maximum Entropy tagger (MaxEnt, see Ratnaparkhi 1996) ac- 
counts for the joint distribution of PoS tags and features of a sentence with 
an exponential model. Its features are along the lines of the CRF model: 

1. Capitalization: Does the token contain a capital letter?; 

2. Hyphenation: Does the token contain a hyphen?; 

3. Numeric: Does the token contain a number?; 

4. Prefix: Frequent prefixes, up to 4 letters long; 

5. Suffix: Frequent suffixes, up to 4 letters long; 

In addition, Ratnaparkhi uses lexical information on frequent words in 
the context of five words. The sizes of the current word, prefix, and suf- 
fix lists were 6458, 3602 and 2925, respectively. These are supplemented 
by special Previous Word vocabularies. Features frequently observed in a 
training corpus are selected from a candidate feature pool. The parameters 
of the model are estimated using the computationally intensive procedure of 
Generalized Iterative Scaling (cis)to maximize the conditional probability 
of the training set given the model. MaxEnt tagger has 3.4% error rate. Our 
investigation examines how much lexical information can be recovered from 
sublexical features. In order to address these issues we reuse the feature set 
of MaxEnt in a new model, which we subsequently minimize with the help 
of linguistically motivated vocabularies. 

2 PoS Tagging Bayesian Net 

Our tagger combines the features suggested in the literature to date into 
a Dynamic Bayesian Network (dbn). We briefly introduce the essential 
aspects of dbns here and refer the reader to a recent dissertation(Murphy 
2002) for an excellent survey. A dbn is a Bayesian network unwrapped in 
time, such that it can represent dependencies between variables at adjacent 
time slices. More formally, a dbn consists of two models B° and B + , where 
B° defines the initial distribution over the variables at time 0, by specifying: 

• set of variables X 1 , . . . , X n ; 

• directed acyclic graph over the variables; 

• for each variable Xi a table specifying the conditional 
probability of Xi given its parents in the graph Pr(Xi\Par{Xi}). 
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The joint probability distribution over the initial state is: 

Pi(X u ...,X n ) =f[Pi(X i \Par{X i }). 
1 

The transition model B + specifies the conditional probability distribution 
(cpd) over the state at time t given the state at time t—1. B + consists of: 

• directed acyclic graph over the variables Xi, . . . , X n and their prede- 
cessors X{ , . . . , X~ — roots of this graph; 

• conditional probability tables Pr(Xi\Par{Xi}) for all Xi (but not 

The transition probability distribution is: 

Pr(X l5 X n \xr, ...,X~)=fl Pr(Xi\Par{Xi}). 

1 

Between them, B° and B + define a probability distribution over the realiza- 
tions of a system through time, which justifies calling these BNs "dynamic" . 
In our setting, the word's index in a sentence corresponds to time, while 
realizations of a system correspond to correctly tagged English sentences. 
Probabilistic reasoning about such system constitutes inference. 

Standard inference algorithms for dbns are similar to those for hmms. 
Note that, while the kind of DBN we consider could be converted into an 
equivalent hmm, that would render the inference intractable due to a huge 
resulting state space. In a DBN, some of the variables will typically be 
observed, while others will be hidden. The typical inference task is to de- 
termine the probability distribution over the states of a hidden variable over 
time, given time series data of the observed variables. This is usually ac- 
complished using the forwardbackward algorithm. Alternatively, we might 
obtain the most likely sequence of hidden variables using the Viterbi algo- 
rithm. These two kinds of inference yield resulting PoS tags. Note that 
there is no need to use "beam search" (cf. Brants 2000). 

Learning the parameters of a DBN from data is generally accomplished 
using the EM algorithm. However, in our model, learning is equivalent to 
collecting statistics over cooccurrences of feature values and tags. This is 
implemented in gawk scripts and takes minutes on the WSJ training corpus. 
Compare this to GIS or lis (Improved Iterative Scaling) used by MaxEnt. 
In large DBNs, exact inference algorithms are intractable, and so a variety of 
approximate methods has been developed. However, as we explain below, 
the number of hidden state variables in our model is small enough to allow 
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exact algorithms to work. For the inference we use the standard algorithms, 
as implemented in the Bayesian network toolkit (bnt, see Murphy 2002). 




We base our original dbn on the feature set of Ratnaparkhi's MaxEnt: 
the set of observable nodes in our network consists of the current word 
Wi, a set of binary variables Cj, Hi and iVj (for Capitalization, Hyphen 
and Number) and multivalued variables Pj and Si (for Prefix and Suffix), 
where subscript i stands for position index. There are two hidden variables: 
Ti and Mj (PoS and Memory). Memory represents contextual information 
about the antepenultimate PoS tag. A special value of Memory ( "Start" ) 
indicates the beginning of the sentence. The PoS values are 45 tags of 
the Penn Treebank tag set (Marcus, Kim, Marcinkiewicz, Maclntyre, Bies, 
Ferguson, Katz & Schasberger 1994). 

Figure 1 represents dependencies among the variables. Clearly, this 
model makes a few unrealistic assumptions about variable independence 
and Markov property of the sequence. Empirically this does not present a 
problem. For the discussion of these issues please see Bilmes (2003) who is 
using similar models for speech recognition. Thus, probability of a complete 
sequence of PoS tags 7\ . . . T n is modeled as: 

Pr(Ti . ..T n ) = Pr(r 1 )xPr(P 1 |T 1 )xPr(T 2 |T 1 ,^ari)xPr(P 2 |r 2 )xPr(M 2 |T 1 ) 
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x nraPrOnm-i^-i) x Pr(M i |T i _ 1 ,M i _ 1 ) x Pr(F,|T 4 ) 
x Pr(T n |T n _i,M n _i) x Pr(F n |T n ), 
where Fj is a set of features at index i e [l..n] and: 

Pr(F i |T,) = Pr(S',|rOxPr(^T,)xPr(W i |T,)xPr(C i |rOxPr(^rOxPr(iVjT,) . 
These probabilities are directly estimated from the training corpus. 

3 Experiments and Results 

We use sections 0-22 of WSJ for training and sections 23, 24 as a final test 
set. The same split of the data was used in recent publications (Toutanova 
& Manning 2002; Lafferty, McCallum & Pereira 2001) that report relatively 
high performance on out-of-vocabulary (OoV) items. The test sections con- 
tain 4792 sentences out of about 55600 total sentences in WSJ corpus. The 
average length of a sentence is 23 tokens. The Brown corpus is another 
part of UPenn TreeBank dataset, which is of a similar size to WSJ (1016277 
tokens) but quite different in style and nature. The Brown corpus has sub- 
stantially richer lexicon and was chosen by us to test the performance on 
novel text. 

We begin our experiments by combining the original MaxEnt feature 
set into a dbn we call LegoTag to emphasize its compositional nature. The 
initial network achieves 3.6% error (see Table 1) and closely matches that 
of MaxEnt (3.4%). Our first step is to reduce the complexity of our tagger 
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Table 1: Results for Full LegoTag on WSJ. 



because performing inference on the dbn containing a conditional proba- 
bility table of 45 3 elements for Memory variable is cumbersome. At the 
cost of minor deterioration in performance (3.9%, see Table 1), we com- 
press the representation by clustering Memory values that predict similar 
distribution over Current tag values. The clustering method is based on 
Euclidian distance between 45 2 dimensional probability vectors Pr(Tj|Tj_i). 
We perform agglomerative clustering, minimizing the sparseness of clusters 
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(by assigning a given point to the cluster whose farthest point it is closest 
to). As a result of clustering, the number of Memory values is reduced nine 
times. Consequently, the conditional probability table of Memory and PoS 
become manageable. 

As a second step to simplification of the network, we eliminate feature 
redundancy. We leave only the lowercase form of each word, prefix and 
suffix in the respective vocabulary; remove numbers and hyphens from the 
vocabulary, and use prefix, suffix and hyphen information only if the token 
is not in the lexicon. The size of the factored vocabularies for word, prefix 
and suffix is 5705, 2232 and 2420 respectively (a reduction of 12%, 38% and 
17%). Comparing the performance of LegoTag with factored and unfac- 
tored features clearly indicates that factoring pays off (Table 1). Factored 
LegoTag is better on unknown words and at the sentence level, as well as 
overall. In addition, factoring simplifies the tagger by reducing the number 
of feature values. We report three kinds of results: overall error, error on 
unknown words (OoV), and per sentence error . Our first result (Table 2) 
shows the performance of our network without the variable Word. Even 
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Table 2: Results of de-lexicalized and fully lexicalized LegoTag for WSJ. 

when all words in the text are unknown, sublexical features carry enough 
information to tag almost 89% of the corpus. 

Next, we test two degenerate variants: one which relies only on lexical 
information, and another which relies on lexical information plus capitaliza- 
tion. Lexical information alone does very poorly on unknown words, which 
comes to show that context alone is not enough to uncover the correct PoS. 

We now turn to the issue of using the morphological cues in PoS tagging 
and create a linguistically "smart" network (Smart LegoTag), whose vo- 
cabularies contain a collection of function words, and linguistically relevant 
prefixes and suffixes assembled from preparatory materials for the English 
language section of college entrance examination (Scholastic Aptitude Test). 
The vocabularies are very small: 315, 100, and 72, respectively. The per- 
centage of unknown words depends on vocabulary size. For the large lexicon 
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of Lego Tag it is less than 12%, while for the Smart LegoTag (whose lexicon 
contains only function words which are few but very frequent), it is around 
50%. In addition, two hybrid networks are created by crossing the suffix set 
and word lexicon of the Full LegoTag and Smart LegoTag. 

The results for the Smart LegoTag, as well as for the Hybrid LegoTags 
are presented in Table 3. They suggest that nonlexical information is suf- 
ficient to assure a stable, albeit not stellar, performance across corpora. 
Smart LegoTag was trained on WSJ and tested on both WSJ and Brown 
corpora with very similar results. The sentence accuracy is generally lower 
for the Brown corpus than for the WSJ corpus, due to the difference in aver- 
age length. The Hybrid LegoTag with big suffix set and small word lexicon 
was a little improvement over Smart LegoTag alone. Notably, however, it 
is better on unknown words than Full LegoTag on the Brown corpus. 

The best performance across corpora was registered by the second Hy- 
brid LegoTag (with big word lexicon and small suffix set). This is a very 
interesting result indicating that the nonlinguistically relevant suffixes in 
the big lexicon contain a lot of idiosyncratic information about the WSJ 
corpus and are harmful to performance on different corpora. Full LegoTag 
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Table 3: Results for Smart and Hybrid LegoTags. 



and Smart LegoTag encounter qualitatively similar difficulties. Since func- 
tion words are part of the lexicon of both networks, there is no significant 
change in the success rate over function words. The biggest source of error 
for both is the noun/adjective (NN/JJ) pair (19.3% of the total error for 
Smart LegoTag; 21.4% of the total error for Full LegoTag). By and large, 
both networks accurately classify the proper nouns, while mislabeling ad- 
verbs as prepositions and vice versa. The latter mistake is probably due 
to inconsistency within the corpus (see Ratnaparkhi 1996 for discussion). 
One place where the two networks differ qualitatively is in their treatment 
of verbs. Smart LegoTag often mistakes bare verb forms for nouns. This 
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is likely due to the fact that a phrase involving "to" and a following word 
can be interpreted either analogously to "to mom" (to + NN) or analo- 
gously to "to go" (to + VB) in the absence of lexical information. Similar 
types of contexts could account for the overall increased number of confu- 
sions of verb forms with nouns with Smart LegoTag. On the other hand, 
Smart LegoTag is much better at separating bare verb forms (VB) from 
present tense verbs (VBP) because it does not rely on lexical information 
that is potentially confusing since both forms are identical. However, it of- 
ten fails to differentiate present verbs (VBP) from past tense verbs (VBD), 
presumably because it does not recognize frequent irregular forms. Adding 
irregular verbs to the lexicon may be a way of improving Smart LegoTag. 

4 Conclusion 

DBNs provide an elegant solution to PoS tagging. They allow flexibility 
in selecting the relevant features, representing dependencies and reducing 
the number of parameters in a principled way. Our experiments with a 
dbn tagger underline the importance of selecting an efficient feature set. 
Eliminating redundancies in the feature vocabularies improves performance. 
Furthermore, reducing lexicalization leads to a higher capacity for general- 
ization. Delexicalized taggers make fewer errors on unknown words, which 
naturally results in more robust success rate across corpora. 

The relevance of a given feature to PoS tagging varies across languages. 
For example, languages with rich morphology would call for a greater re- 
liance on suffix/prefix information. Spelling conventions may increase the 
role of the Capitalization feature (e.g. German). In the future, we hope to 
develop methods for automatic induction of efficient feature sets and adapt 
the dbn tagger to other languages. 
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